These exercises will help you practice applying separate, unite, and regular expressions. You will use a messy dataset with information about cardiovascular disease (CVD).
Before starting these exercises, you should have a good understanding of
The Tidy your data Primer.
Chapter 12.3 - 12.7 and Chapter 14 of R for Data Science
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
library(magrittr)
cvd_messy_descr <-
c("ID" = 'Participant identification',
"question_age" = "Question: how old are you / when where you born? Participants 51 or older answered the second question.",
"question_substance" = 'Question: do you smoke or drink?',
"question_bp" = 'Question: what is your blood pressure? Are you taking medications to lower your blood pressure?',
"labs" = 'A collection of laboratory values concatenated into a single string. Notably, the order of lab values is random',
"cvd_fup" = 'Report of whether this participant exerienced a cardiovascular disease event (i.e., stroke or coronary heart disease) after their interview')
# the enframe function transforms a vector into a tibble,
tibble::enframe(cvd_messy_descr) %>%
gt::gt(rowname_col = "name") %>%
gt::tab_stubhead(label = 'Variable name') %>%
gt::cols_label(value = 'Variable description') %>%
gt::cols_align('left') %>%
gt::tab_header(title = 'Description of messy cardiovascular disease data')
| Description of messy cardiovascular disease data | |
|---|---|
| Variable name | Variable description |
| ID | Participant identification |
| question_age | Question: how old are you / when where you born? Participants 51 or older answered the second question. |
| question_substance | Question: do you smoke or drink? |
| question_bp | Question: what is your blood pressure? Are you taking medications to lower your blood pressure? |
| labs | A collection of laboratory values concatenated into a single string. Notably, the order of lab values is random |
| cvd_fup | Report of whether this participant exerienced a cardiovascular disease event (i.e., stroke or coronary heart disease) after their interview |
cvd_messy <- readr::read_rds('data/cvd_messy.rds')
cvd_messy
Tidy the data up to create the following columns:
ID: (numeric) participant identificationcvd_status: (numeric) 0 if no CVD, 1 if CVDcvd_time: (numeric) years from interview to CVD or loss to follow-upsbp: (numeric) systolic blood pressure, mm Hgdbp: (numeric) diastolic blood pressure, mm Hgbp_meds: (factor) Yes/No for use of blood pressure lowering medicationage_number: (numeric) age in yearsdrink: (factor) Yes/No for drinkingsmoke: (factor) Yes/No for smokingalbumin: (numeric) albumin levelshba1c: (numeric) HbA1C levelscreatinine: (numeric) creatinine levelsEach column can be cleaned in a number of different ways.
to create cvd_time, I recommend making a slight modification to the regular expression we used in the lecture.
A lot of the other variables can be managed with str_detect, str_extract, and str_remove.
You could also consider converting some variables into factors with new labels and then using separate.
For the lab values, look up a new function: ?separate_rows. The problem will be much harder if you do not use separate_rows
Once you are finished, remove the original messy columns and convert any character valued columns to factors. Your cleaned data should look like this:
readr::read_rds('solutions/01_solution.rds')
Create new columns:
diabetes (factor) Yes if HbA1C is greater than 6.5, No if less than or equal to 6.5
albuminuria (factor) ‘Yes’ if albumin / creatinine is greater than or equal to 30 and ‘No’ if albumin / creatinine is less than 30
bp_midrange (factor) Yes if at least one of the two conditions below are true:
rec_bpmeds_acc_aha (factor) ‘Yes’ if any of the conditions below are TRUE, and ‘No’ if all of them are FALSE.
bp_midrange == ‘Yes’ and albuminuria == ‘Yes’bp_midrange == ‘Yes’ and diabetes == ‘Yes’bp_midrange == ‘Yes’ and age_number > 65rec_bpmeds_jnc7 (factor) Yes if SBP >= 140 OR DBP >= 90, ‘No’ if SBP is less than or equal to 140 and DBP is less than or equal to 90.
Note: rec_bpmeds_acc_aha is a simplified version of the 2017 American College of Cardiology and American Heart Association’s BP guidelines.
readr::read_rds('solutions/02_solution.rds')
Use count and mutate, glue, and pivot_wider to make the following table summarizing counts and percent of diabetes, stratified by recommendations to initiate or intensify BP lowering. Remember to group and ungroup the data appropriately.
readr::read_rds('solutions/03_solution.rds')
You might imagine doing Problem 3 for all variables and then dealing with combining results into a participant characteristics table. Sounds pretty tedious, right? The gtsummary package is here for you. Explore the package website and focus on the tbl_summary() vignette. When you are ready, try using tbl_summary() on the data you created.
Before creating your table, make sure that all of the character variables in your data are converted to factor variables, and that all of your factor variables are given an explicit NA coding such that missing values are given a value of ‘Unknown’.
readr::read_rds('solutions/04_solution.rds')
| Characteristic |
Recommended to initiate or intensify medications to lower BP by the 2017 ACC/AHA guidelines |
||
|---|---|---|---|
| No, N = 60821 | Yes, N = 29931 | Unknown, N = 9251 | |
| Recommended initiation / intensification by JNC7 | |||
| No | 6082 (100%) | 974 (33%) | 805 (87%) |
| Yes | 0 (0%) | 2019 (67%) | 0 (0%) |
| Unknown | 0 (0%) | 0 (0%) | 120 (13%) |
| Age, years | 51 (43, 61) | 63 (54, 70) | 54 (46, 61) |
| Systolic blood pressure, mm Hg | 118 (111, 125) | 144 (136, 152) | 131 (126, 136) |
| Unknown | 0 | 0 | 120 |
| Diastolic blood pressure, mm Hg | 73 (68, 78) | 80 (73, 87) | 82 (77, 84) |
| Unknown | 0 | 0 | 120 |
| Systolic/diastolic BP 130-140/80-90 mm Hg | |||
| No | 4907 (81%) | 1245 (42%) | 54 (5.8%) |
| Yes | 1175 (19%) | 1748 (58%) | 751 (81%) |
| Unknown | 0 (0%) | 0 (0%) | 120 (13%) |
| Currently using BP lowering medication | |||
| No | 3440 (57%) | 980 (33%) | 394 (43%) |
| Yes | 2642 (43%) | 2013 (67%) | 452 (49%) |
| Unknown | 0 (0%) | 0 (0%) | 79 (8.5%) |
| Alcohol | |||
| No | 3008 (49%) | 1842 (62%) | 436 (47%) |
| Yes | 2982 (49%) | 1113 (37%) | 481 (52%) |
| Unknown | 92 (1.5%) | 38 (1.3%) | 8 (0.9%) |
| Smoking | |||
| No | 5217 (86%) | 2536 (85%) | 786 (85%) |
| Yes | 773 (13%) | 419 (14%) | 131 (14%) |
| Unknown | 92 (1.5%) | 38 (1.3%) | 8 (0.9%) |
| Hemoglobin A1c | 5.60 (5.20, 6.00) | 6.00 (5.50, 6.80) | 5.60 (5.20, 6.00) |
| Unknown | 167 | 113 | 125 |
| Diabetes | |||
| No | 5181 (85%) | 2018 (67%) | 774 (84%) |
| Yes | 734 (12%) | 862 (29%) | 26 (2.8%) |
| Unknown | 167 (2.7%) | 113 (3.8%) | 125 (14%) |
| Albuminuria | |||
| No | 4198 (69%) | 1634 (55%) | 151 (16%) |
| Yes | 49 (0.8%) | 85 (2.8%) | 3 (0.3%) |
| Unknown | 1835 (30%) | 1274 (43%) | 771 (83%) |
|
1
Statistics presented: n (%); median (IQR)
|
|||
The ACC/AHA guideline may recommend initiating or intensifying medication to lower BP for adults with SBP/DBP greater than 130/80. The lower BP threshold has been criticized in many editorials. Using your data, assess the merit of these criticisms:
Create a new dataset where neither rec_bpmeds_acc_aha nor rec_bpmeds_jnc7 have any ‘Unknown’ values.
Create a new variable by uniting the rec_bpmeds_acc_aha with the rec_bpmeds_jnc7 column.
You should have three categories in the new variable. Recode them as follows:
# make sure your dataset is called cvd_model
# mdl <- coxph(Surv(cvd_time, cvd_status) ~ rec, data= cvd_model)
tbl_regression() function to summarize your model. Make sure to set exponentiate = TRUE so that tbl_regression() will present hazard ratios. Your table should look like this:readr::read_rds('solutions/05_solution.rds')
| Characteristic | HR1 | 95% CI1 | p-value |
|---|---|---|---|
| Recommendation for BP lowering medications | |||
| Not recommended BP medication by either guideline | — | — | |
| Recommended BP medication by ACC/AHA only | 2.34 | 1.97, 2.78 | <0.001 |
| Recommended BP medication by both guidelines | 1.85 | 1.60, 2.14 | <0.001 |
|
1
HR = Hazard Ratio, CI = Confidence Interval
|
|||